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Summary 


A complete modeling of faults at gate level for a fault-tolerant computer 
is both infeasible and uneconomical* Functional Fault modeling is an approach 
where units are characterized at an intermediate level and then combined to 
determine fault behavior. This report is a preliminary study on the 
applicability of Functional Fault Modeling to the FTMP. Using this model a 
forecast of error latency is made for some functional blocks. This approach 
may be useful in representing larger sections of the hardware and may aid in 
uncovering system-level deficiencies. 
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Introduction 

The complexity of a fault-tolerant computer makes it impractical to 
exhaustively estimate its reliability parameters using a low level fault model 
C 1 * 2) A particular logic function may be implemented in different- forms as a 
technology matures or seperate venders may select different yet equivalent 
forms. Even in current MSI circuits, equivalent logic specifications are used 
rather than detailed transistor representations. A higher level approach is 
Functional Fault Modeling (3-5). In a Functional Model gates cease to be the 
primitives in analysis. The whole hardware structure is partitioned into 
functional primitives. The partitioning is based on many factors. Whatever 
partitioning was used in the design process is usually the starting point for 
functional fault modeling. Faults occuring internal to a partition propagate 
to the outputs of the partition (all gate level faults will manifest as some 
functional level faults). This mapping of faults is many to one thus making 
functional level modeling potentially simpler. 

If the technology is well understood then the functional model can be 
made to accurately represent the partition both in range of behavior and in 
statistical characteristics. A library of models can be developed including a 
hierarchy of units (e.g., flip-flop models used to develope a counter model, 
etc.). When the technology is not well understood- the functional modeling can 
be done in a more conservative manner (6-11). All conceivable fault beha/lor 
could be represented, then as actual behavior data becomes available the 
functional model can be made more accurate. 

Because of the highly redundant nature of the FTMP, it is difficult to 
surface many of its faults during its normal operation. But FTMP behaves as a 
fault secure circuit for most of the faults. Eventually when that redundant 
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portion of the circuit is exercised the fault affect propagates to the outputs 
of the partition. Since we know the circuit structure it is easy to break an 
existing functional primitive into smaller primitives and improve the accuracy 
of the model. 

FACTORS INFLUENCING ERROR LATENCY 

In FTMP, Error Latency is influenced by hardware and software. Faults on 
an active single bus line are masked by the voters and except in the BGU's, 
the error decoding circuits detect these errors. 

Three primary factors influence error latency: 

1. Whether the Faulty circuit is exercised, 

2. The rate at which the error latches are read, and 

3. Whether that part is influencing the detection circuitry. 

Each of the above cases needs to be analysed carefully to arrive at an 
accurate overall latency estimate. 

1. The first factor is a natural consequence of the highly redundant 
nature of the FTMP. If the fault occurs in an inactive region it is harmless, 
but another fault at this juncture might cause abnormal behaviour. An example 
of this is a fault on one of the spare buses as the first fault followed by a 
fault on an active bus. This might result in replacing all the units enabled 
on the faulty active bus by another faulty bus. Subsequent detection of the 
second fault might take several cycles. It is imperative to activate 
periodically all the redundant parts just to avoid long latency faults. Some 
assumptions were made regarding exercising the redundant parts to arrive at a 
definitive figures for the error latency. The particular assumptions are 
given later. 
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2. The second factor is software determined. A program called SCC 
(System Configuration Controller) performs the chore of reading and 
interpreting the error latches. The dispatcher directs SCC to the leastly 
loaded triad. The first part of the program determines whether any errors 
were reported in the preceding frame. It also determines whether the error 
reported during a previous frame has been corrected. If not it waits for a 
maximum of four cycles for the previous error to be corrected. An assumption 
is made in this regard to make definitive forecasts. 

3. Some parts like the BGUs do not have error detection circuitry. 
Single faults here are often masked although all parts may be active. These 
latent faults have to be dealt with specially. Normally there is no way to 
propagate many of the single faults in the voting hardware and deskewers 
unless we have another fault in such a way that they cooperate to cause 
noticeable faulty output behaviour. 

DESCRIPTION OF DETECTION PROCESS AT HIGH LEVEL 

It may be recalled that all tasks run at one of the three preassigned 

rates. The assumptions made in this paper regarding the rates are: 

1. R1 rate is 3.125 hz (320 msecs). 

2. R3 rate is 12.5 hz ( 80 msecs). 

3. R4 rate is 25 hz ( 40 msecs). 

SCC, the high level fault handling program runs at R1 rate, i.e. every 320 
msecs this program processes the error information supplied by hardware. SCC 
also aids in surfacing the faults by running self testing programs and 
activating spare units at regular Intervals. We can summarize the fault 
detection process as the arrival of disagreement at the voters of a triad, 
stimulated by normal activity or test activity. Test activity includes self 
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testing and spare cycling phases. The detection of faults initiates fault 
identification and later reconfiguration. The identification and 
reconfiguration is done with the help of special procedures initiated by 
SCC. To have an accurate prediction of fault detection times, for various 
faults it is important to know how the tasks constituting normal activity and 
test activity are dispatched. The flow chart in Figure 1 explains SCCs 
dispatching strategy at a high level. We see that test activity is dependent 
on a parameter set in software (Time to Cycle). Whenever a swap command is 
executed this parameter is initialised. In the current configuration Time to 
Cycle will be true every 5 seconds. When this boolean value is true, spare 
cycling is done. The purpose of test activity is to propagate the affects of 
any faults in spare units to the error latches so that SCC detects them. 


DESCRIPTION OF SELF TESTING IN FTMP 

Self Test programs are run to detect some of the latent faults in error 
latches, voters, error decoding circuitry, and cache PROMs of the FTMP. SCC 
calls the master self test program «,SELF“TEST) which in turn calls one of the 
38 self tests. Each self test is designed to test a specific unit and in each 
run of SCC only one of the tests is invoked. The P, R, and T tests are 
invoked if three corresponding bus lines are active. The C test is performed 
if A clock lines are active. Essentially in P, R, T, and C tests a 
disagreeing input stream is fed on one of the active lines and error latch 
contents are checked to see whether the injected fault is reported. In PROM 
test a checksum verification is made on different segments. To simplify the 
analysis some assumptions were made regarding the self tests. 
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ASSUMPTIONS ON SELF TESTS 

As can be seen from flow chart In Figure 2 in some cycles SCC does not 
invoke self tests. Since the rate of self testing is much larger than spare 
cycling, it is assumed that self tests run periodically without interruption 
from Spare Cycling. This will give estimates which are lower bounds on 
average times. Based on this, the assumptions ST1 and ST2 are made. 

ST1. Self tests are run at R1 rate, i.e., every 320 msecs. 

ST2. Time to complete one cycle of self tests ■ 38 * 320 msecs ■ 12.160 
secs. The order is given in Figures 2 and 3. 

Occurrence of a fault is random with respect to the self test cycle. The 

segmented testing employed makes the detection time vary between zero (when 
occurrence of detectable fault is immediately followed by its detecting self 
test) and the time taken to complete the whole cycle (when the detecting self 
test runs just prior to the occurrence of its detectable fault). ST3 follows 
from the above. 

ST3. Mean time to detection of any fault which can be detected in one 

self test cycle ■ 38 * 320/2 ■ 6.08 secs, 

DESCRIPTION OF SPARE CYCLING IN FTMP 

FTMP has processors, memories, and buses In. its spare pool. Units from 
the pool can be brought on line to replace any failed active unit. To uncover 
latent faults spares are periodically brought online even if all active units 
are functioning correctly. Spares are assigned as shadows to active triads. 

A shadow essentially duplicates the activities of the triad it is assigned to 
track and differs from active units in its access priorities to the buslines 
(e.g., a processor shadow cannot participate in polling, thus denying it 
access to the transmit bus). But shadows watch the R and C bus lines to 
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Schedule of Self Teste 


1. Reset self test state. 

2. POLL. ANY TRIAD. LOW ORDER 

3. PROM, TRIAD1, 1800 

4. PROM, TR1AD2, 1800 

5. PROM, TRIAD3, 1800 

6. R, TRIAD1, LOW ORDER 

7. R, TRIAD2, LOW ORDER 

8. R, TRIAD3, LOW ORDER 

9. PROM, TRIAD1 , 1A00 

10. T, TR1AD1, LOW ORDER 

11. T, TRIAD2, LOW ORDER 

12. T, TR1AD3, LOW ORDER 

13. C, ANY TRIAD, LOWEST ORDER 

14. POLL, TRIAD1, MIDDLE ORDER 

15. PROM, TRIAD2, 1A00 

16. PROM, TRIAD3, 1A00 

17. PROM, TRIAD1, 1C00 

18. R, TRIAD 1 , MIDDLE ORDER 

19. R, TRIAD2, MIDDLE ORDER 

20. R, TRIAD3, MIDDLE ORDER 

21. PROM, TRIAD2, 1C00 

22. T, TRIAD 1 , MIDDLE ORDER 

23. T, TRIAD?, MIDDLE ORDER 

24. T, TRIAD3, MIDDLE ORDER 

25. C, ANY TRIAD, 2ND LOW 

26. POLL, TRIADl , HIGH ORDER 

27. PROM, TRIAD3, 1C00 

28. PROM, TRIADl, 1E00 

29. PROM, TRIAD2, 1E00 

30. R, TRIADl, HIGH ORDER 

31. R, TRIAD2, HIGH ORDER 

32. R, TRIAD3, HIGH ORDER 

33. PROM, TRIAD3, 1E00 

34. T, TRIADl, HIGH ORDER 

35. T, TRIAD2, HIGH ORDER 

36. T, TRIAD3, HIGH ORDER 

37. C, ANY TRIAD, 2ND HIGH 

38. C, ANY TRIAD, HIGH ORDER 


j 

I 

1 


Figure 3. Schedule of self test progras units 
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maintain synchronism. To get them on-line , only the BGUs of the shadow and 
the unit It Is replacing have to be written Into. SCC calls a procedure 
ISSUE-SWAP-OIND to perform spare cycling. This Is Issued whenever Time to 
Cycle Is true (see flow chart). This procedure determines which spare unit 
should be brought on-line and which unit It should replace. On each pass only 
one spare unit la swapped* The order In which spare cycling is done Is given 
in Figure 4. 

ASSUMPTIONS ON SPARE CYCLING 

In the present system “Time to Cycle** is true every 5 secs in the absence 
of reported faults. There is an interaction between spare cycling and 
self testing because both of them cannot be done In a single cycle. Spare 
cycling occurs at least twice In a cycle of self tests. Since the spare 
cycling rate Is ouch smaller In comparison to self testing, we assume both 
self testing and spare cycling occur periodically without any interaction. 

This will result in a more optimistic estimate or a lower bound on the 
latency. Moreover the faults which can be propagated by spare cycling are 
configuration dependent. Cycling changes inactive units, of a particular 
configuration into active by changing the system configuration. This makes 
the otherwise latent faults get detected in subsequent normal activity. The 
swapping of processors and memories depends on the replacement policy. 

Swapping of bus lines Is easier to visualise as the maximum number of spare 

bus lines of a particular type can be two. The following assumptions are made 

regarding the rate and mode of spare cycling. 

SCI. Spare cycling Is done every 5 secs. 

SC2. Spare cycling follows the order shown In Figure 4. 

SC3. After the decision on which triad to Issue swap command is made, 
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the unite ere swapped using the LRU (Least Recently Used) 
algorithm. Here the unit which will be used to replace Is the 
one which has been longest In spare pool for that triad and the 
unit It will replace is the one which Is active for the longest 
time . 

Figures 5a and 5b 11’ strata this assumption as applied to processors and 
buses respectively. We have Illustrated the case of 12 pro'„asors with no. 
failed processors or buses. 


FAULT CLASSIFICATION 

It is clear that SCC detectc many different types of faults and It uses 
various program) to uncover then. Classification of faults Is based on the 
type of faults which would be detic'ed by a particular detection program. 
Faults which go undetected are also classified. A level tree diagram (Figure 
6) is given which describes the natutv and difficulty in detecting a fault. 

Class 1: This corresponds to thone faults which become visible during the 

normal activity In the form of isagreement at voter propagated to 
the error latches. 

Class 2: Faults in redundant hardware. PROMs, error latches which do not show 

up in normal activity belongs to this class. Self tests uncover 
these faults. 

Class 3; Faults in inactive units serving as spares belong to this class. 
Spare Cycling uncovers these faults. 

Class 4: Faults which go undetected. 
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EXAMPLE TO DEMONSTRATE FUNCTIONAL FAULT MODELING 

In this section we choose a specific unit for functional fault 
modeling. The procedure similar to this can be adopted to compute the fault 
detection times for any other unit. A broad classification of the functional 
misbehavior patterns of the unit in question is done. The classification 
should be realistic in the sense that physical or logical gate level, faults 
could produce that kind of functional misbehavior. After having defined the 
misbehavior patterns one can categorize them into the fault classes discussed 
in the previous sections depending upon how and when they set the system error 
latches. To obtain the distribution of various faults amongst the fault 
classes one has to identify the number of faults which result in a particular 
misbehavior pattern. Then by exhausting all possible misbehavior patterns one 
can obtain the resulting distribution faults belonging to each class. The 
resulting distribution faults coupled with the assumptions of fault detection 
software made in earlier sections can be used to obtain curves for fault 
detection times. 

We choose the 3/5 input-select unit given in Figure 7 which provides a 
representative partitioning of the fault classes. This unit is present in all 
the bus interfaces and BGU's. Its function is to select 3 out of 5 input 
lines based on the 4 bit select code supplied by. the control register (see 
Table 1). Regardless of the txact nature of implementation, the following 
functional fault modeling approach can be taken. 

MB1. Selects less than three active lines (assuming no duplication of 
any active line). 

MB2. Duplicated active line comprises two of the three active lines 


selected 
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MB3. Faults which result in functional misbehavior only when the select 
code or system configuration is changed (faults here manifest as 
MB1 or MB2 when configuration is changed). 

MB4. Faults which cannot be detected. 

Gate level faults can be maped into the above functional fault 
classification. 

Except for the BGU's, all other 3/5 select units are used in conjunction 
with a voter-error ROM combination and MBi lines up with Class i. These 
functional classes MBi will fall into the earlier overall fault classes 
depending upon the location of the 3/5 select unit. If the select unit is 
located in a spare then MBI, MB2, and MB3 are in Class 3, faults in inactive 
units. When the select unit is in a BGU, MBI falls into Class 2 which 
requires special testing. 

To illustrate the many to one mapping we take a specific implementation 
of the 3/5 select (T BUS INTERFACE) and map the gate level faults to MBI. MB2, 
MB3, and MB4 (Classl, Class2, Class3 and Clas$4 respectively). Some faults 
are data dependent and map o different fault behavior for different 
implementations. To illustrate this, let the initial select code correspond 
to lines 1, 2, and 4 being active. Owing to some fault if 1, 3, and 3 are 
selected then clearly it falls to MBI in say the T bus interface. But if the 
initial configuration changes from 1, 2, and 3 to 1, 3, and 3 due to a fault 
then the fault falls to MB2. The duplication of a line for a select code is 
implmentation dependent. Therefore it is better to consider implementation 
details to obtain a finer model. Assuming that all possible errors can occur 
in the select code we tabulate the faults. Since the select code at a 
particular instant of time is not known an averaging technique is used. We 
assume that all valid codes are equally likely. By this we get distribution 
of select code faults into various classes (see Table 2). 


Select Code 
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INITIAL CODE 

# CLASS 1 

tfCLASS 2 

# CLASS 3 

# CLASS 4 

0000 

15 

0 

0 

0 

0001 

15 

0 

0 

0 

0010 

13 

2 

0 

0 

0011 

13 

2 

0 

0 

0100 

13 

2 

0 

0 

0101 

13 

2 

0 

0 

0110 

15 

0 

0 

0 

0111 

15 

0 

0 

0 

1000 

10 

4 

1 

0 

mi 

14 

0 

1 

0 


Average 


13.6 


1.2 0.2 


0 


Table 2. Select code input error behavior 
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In a similar vein we can get distribution of input and output line stuck 
faults. Here ve assume that the select code is not faulty and faults manifest 
on input and output lines. Faults assumed for input and output signal lines 
are stuck-at-1, stuck-at-0 or inversion of the signal line. The number of 
fault cases for each legal code is given in Table 3. 


SELECT CODE 

0 CLASS 1 

0 CLASS 2 

//CLASS 3 

0 CLASS 4 

0000 

18 

0 

33 

6 

0001 

18 

0 

33 

6 

0010 

18 

0 

33 

6 

0011 

18 

0 

33 

6 

0100 

18 

0 

33 

6 

0101 

18 

0 

33 

6 

0110 

18 

0 

33 

6 

0111 

18 

0 

33 

6 

1000 

18 

0 

33 

6 

1111 

18 

0 

33 

6 

Average 

18 

0 

33 

6 


Table 3. Input or output faults 
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Combining tables 2 and 3 we get the following distribution of faults: 


Class 

1 

43.88% 

Class 

2 

1.66% 

Class 

3 

46. m 

Class 

4 

8.33% 


Table 4. 

Class 1 faults will be detected between 0-320 msec. Class 2 faults here would 
require error-latch and voters self tests. Based on the assumptions made 
earlier, the Class 2 faults will be detected between 0-12.16 sec. 

Class 3 fault detection in the case of 3/5 select unit requires cycling 
spare lines. From Figure 5b time between two bus swap commands is 140 secs. 
This means that half of the Class 3 errors could be detected in a maximum time 
of 140 secs and the rest after the next 140 secs. This case is plotted in 
Figure 8. 

We summarize the assumptions for Figure 8: 

1. The faults are equally likely input select code errors and equally 
likely bus input and select output stuck errors. 

2. The select unit is not in a BGU but is equally likely to be any 
other location. 

3. Spare cycling and self-test run at their most frequent rate, i.e., 
the system is lightly loaded. 
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Triad 1 

Triad 2 

Triad 3 

A 

1 2 3 (10) 

456 (11) 

7 8 9 (12) 

B 

10 2 3 (1)* 

456 (11) 

789 (12) 

C 

10 2 3 (1) 

11 5 6 (4)* 

789 (12) 

D 

10 2 3 (1) 

11 5 6 (4) 

12 8 9 (7)* 

E 

10 1 3 (2)* 

11 5 6 (4) 

12 8 9 (7) 

F 

10 1 3 (2) 

11 4 6 (5)* 

12 8 9 (7) 

G 

10 1 3 (2) 

11 4 6 (5) 

12 7 9 (8)* 

H 

10 1 2 (3)* 

11 4 6 (5) 

12 7 9 (8) 

I 

10 1 2 (3) 

11 4 5 (6)* 

12 7 9 (8) 

J 

10 1 2 (3) 

11 4 5 (6) 

12 7 8 (9)* 

K 

1 2 3 (10)* 

• 11 4 5 (6) 

12 7 8 (9) 

u 

1 2 3 (10) 

456 (11)* 

12 7 8 (9) 

M 

1 2 3 (10) 

4 5 6 (11) 

7 8 9 (12)* 


Swap of shadow and active unit 


Figure 5(a) Processor spare cycling assuming 12 processors 

One major cycle. 
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Figure 5(b) 



Any Detectable Fault 
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LATENCY 

We next rele'.e the functional fault claaaes and latency estimates of the 
3/S aelect unit to a system model. Faulta have algnlflcantly different 
effecta on ayatem behavior. For example a long latency fault in a nonactive 
proceaaor doesn't Influence the ayatem until it la reconfigured as an active 
unit; but the failure of an active P bua line has immediate conaequencea. 

Hence the functional fault daasee ought to be of sufficient detail to allow 
appropriate aaalgnment to the various transitions in the system reliability 
mode (e.g. Care III). 

The classification used In arriving at Tables 2. 3. and 4 would be 
applicable to an FTMP system reliability model which distinguished between 
faults in active proceasors and memories and spare processors and memories. 

But the classification is not of sufficient detail If BGU behavior is 
explicitly included in the reliability model. Specifically the functional 
behavior liBl (selecting less than 3 active lines) should be subdivided for the 
BGU into the selection of 2 active lines and less than two active lines. The 
BGU behavior differs for these two subcases. 

Suppose the reliability model does distinguish between active processors 
and spare processors, then the Classes in Figure 8 would probably be assigned 
as follows: 

1. Class 1 and Class 2 to transitions for a fault in an active 
processor. 

2. Class 3 to transitions for a fault in a spare processor. 

3. Class 4 to reduce the occurence probability of a fault. 


MODELING PROCESS 


The previous sections have attempted to illustrate the functional fault 
aodellng process when applied to a particular unit the estimation of fault 
latency. The following factors are Important in this aodellng. 

1. The detail known about the physical devices and structure of the 
Implementation and the possible fault mechanisms. 

2. The hardware and software structure used to detect that a fault has 
occurred. What redundancy exists in space and time and hot.' is it 
used to detect faults. 

3. The proposed use to be mdae of the latency estimates and the level of 
detail about fault behavior that is required. For example, the 
determination of worst case behavior* 
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